Software Engineer, LLM Platform
A leading AI solutions company is seeking an experienced Software Engineer, LLM Platform to join their R&D team in Menlo Park. This is an exciting opportunity to work on cutting-edge large language model (LLM) technologies while contributing to mission-critical platforms used by enterprise customers. If you’re passionate about building scalable, reliable systems and want to be part of a dynamic, innovative team, this role could be a great fit for you.
This organization is focused on providing enterprises with the tools to create their own Expert AI. The company’s platform enables customers to train and deploy custom models on their own data, with an emphasis on enterprise-grade security, flexibility, and minimal hallucination. Their team is made up of engineers and researchers working on highly impactful technologies, and the company is backed by top-tier VCs and tech firms. The Software Engineer, LLM Platform will play a crucial role in the development and maintenance of these next-gen AI systems.
As a Software Engineer, LLM Platform, you’ll be responsible for the design, implementation, and maintenance of LLM platforms running on Kubernetes. You will work on complex distributed systems, debug challenging issues across Kubernetes clusters, and ensure that customer-facing systems operate seamlessly. This is a hands-on, problem-solving role that demands strong technical expertise and a proactive mindset. You will also collaborate with cross-functional teams, including ML engineers, app engineers, and product managers, to deliver high-quality platform features.
What we can offer you:
- Equity and benefits as part of the total compensation package
- Opportunity to work with top engineers on cutting-edge AI platforms
- Collaborative and inclusive team environment
- Access to a state-of-the-art infrastructure and tools
- A fast-paced, mission-driven work culture focused on innovation and impact
Key Responsibilities:
- Design, implement, and maintain an LLM platform on Kubernetes, supporting LLM tuning and inference workloads
- Troubleshoot complex distributed system problems across Kubernetes environments, often without direct access
- Provide quick and effective responses to customer issues, ensuring a high level of satisfaction
- Manage internal GPU fleet and optimize cluster resources in the data center
- Collaborate with engineering and product teams to define and implement platform features
- Write clear, concise documentation to assist customers in using the product efficiently
The ideal candidate will have proficiency in Python (or similar programming languages) and be familiar with Kubernetes, distributed systems, and machine learning workflows. Experience with LLM training, inference, and RAG systems will be advantageous. If you have built or worked on platforms for large-scale AI applications, we would like to hear from you. A strong understanding of open-source tools, GPU management, and CI/CD pipelines will also help you thrive in this role.